In the United States, our two party system commonly supports very different policies relating to crime. These policies affect how many people are arrested and put into our criminal correction system, and for how long they stay in this system. Being in the corrections system has consequences. While incarcerated, many peoples lives are put on hold. Being in the corrections system leads to the offender and their family being affected financially and emotionally. After being put in the corrections system it is much harder to get a job, housing, and financial aid. The corrections system is also a huge expense for the United States, especially with mass incarceration. Because of this, it is important to look at how criminal policies will affect how many people are put into our corrections system.
Republicans tend to support more hard on crime policies than democrats do. Many of these tough on crime policies increase punishments for crimes. Republicans are also more likely to support criminalization of drugs and many juvenile crimes (like curfews). This could lead to an increase in arrests in states that have a Republican lean. Republicans also tend to have more relaxed laws on weapons and guns, which could result in less arrests related to weapon carrying but also increase chances of violent crimes with weapons occuring. Republicans also argue that due to Democrats having more relaxed crime policies, more crime occurs in democratic areas. This could cause Democratic leaning states to have higher crime. Although, many critics of strict crime policies would argue that the strict policies would result in more people entering the systems and then increasing their likelihood to reoffend and become arrested again.
In this tutorial, we will do a simple analysis comparing arrests rates to the party lean of the state. Ideally, this would be the start to a much bigger analysis of arrest rates, that would look at many more variables.
To look at the relationship between party lean of a state and arrests rates in the state, will will go through the five steps of the data science pipeline:
To complete the tutorial, you will need the following libraries:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import json
import folium
from folium.features import GeoJsonTooltip
import geopandas as gpd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
In this stage, we collect data. There are many ways to do this, you can collect your own data through your own study, you can scrape data off the web, or you can download a dataframe. For this tutorial, we will scrape data for three different things: party affiliation of each state, arrests rates for each state, and population of each state. As party affiliation, population, and polices change over time, it is important that we get the data all from the same time period. For this tutorial, we are looking at 2017.
When scraping data off the web, you will have to parse the HTML to put it in a dataframe. I will walk you through three examples of scraping data from three different websites.
We will use:
To put our first dataset into a dataframe, we need to get the html and store it with requests.get('url'). We will then use BeauitifulSoup to display the html contents of the page so we can inspect it and examine the table format. To find the table structure, I do ctrl + f and search for table.
party_affiliation = requests.get('https://news.gallup.com/poll/226643/2017-party-affiliation-state.aspx')
party_affiliation_soup = BeautifulSoup(party_affiliation.content, 'html.parser')
# party_affiliation_soup
Looking at the table structure, we see that all the data is stored in the table rows (tr). Now we want to collect the table rows in a list.
party_affiliation_elements = party_affiliation_soup.findAll('tr') # This collects all the <tr> elements into a list
# party_affiliation_elements
Examining the elements list above, we see that all the State names are in <th> and the Percent Democrat, Percent Republican, and Party Lean are in <td>. We can figure out the index by counting its <td> location in the <tr> element. The first <td> element is at index 0.
# We want to collect data for 4 columns of the dataframe we are creating
# dictionary we will store the data in after we scrape it
party_affiliation_proto_df = { 'State' : [], 'Percent Democrat' : [], 'Percent Republican' : [], 'Party Lean' : []}
# the data starts at the third index of elements so we will start there
for i in range(3, len(party_affiliation_elements) - 2):
# print ('********************')
# print(i)
# print(party_affiliation_elements[i].findChildren("th")[0].get_text())
# for j in party_affiliation_elements[i].findChildren("td"):
# print(j.get_text())
party_affiliation_proto_df['State'].append(party_affiliation_elements[i].findChildren('th')[0].get_text())
party_affiliation_proto_df['Percent Democrat'].append(party_affiliation_elements[i].findChildren('td')[0].get_text())
party_affiliation_proto_df['Percent Republican'].append(party_affiliation_elements[i].findChildren('td')[1].get_text())
party_affiliation_proto_df['Party Lean'].append(party_affiliation_elements[i].findChildren('td')[4].get_text())
Now we need to sort the proto_df so the states are in alphabetical order. This will make it easier for us to combine the data with the data from other sources. The sorting algorithm below is bubble sort. We have to make sure that as we sort the States, we also sort its other data along with it.
n = len(party_affiliation_proto_df['State'])
for i in range(n-1):
for j in range(0, n-i-1):
if party_affiliation_proto_df['State'][j] > party_affiliation_proto_df['State'][j + 1]:
swapped = True
j_0 = party_affiliation_proto_df['State'][j]
j_1 = party_affiliation_proto_df['State'][j + 1]
party_affiliation_proto_df['State'][j] = j_1
party_affiliation_proto_df['State'][j + 1] = j_0
d_0 = party_affiliation_proto_df['Percent Democrat'][j]
d_1 = party_affiliation_proto_df['Percent Democrat'][j + 1]
party_affiliation_proto_df['Percent Democrat'][j] = d_1
party_affiliation_proto_df['Percent Democrat'][j + 1] = d_0
r_0 = party_affiliation_proto_df['Percent Republican'][j]
r_1 = party_affiliation_proto_df['Percent Republican'][j + 1]
party_affiliation_proto_df['Percent Republican'][j] = r_1
party_affiliation_proto_df['Percent Republican'][j + 1] = r_0
p_0 = party_affiliation_proto_df['Party Lean'][j]
p_1 = party_affiliation_proto_df['Party Lean'][j + 1]
party_affiliation_proto_df['Party Lean'][j] = p_1
party_affiliation_proto_df['Party Lean'][j + 1] = p_0
# party_affiliation_proto_df
Now we can collect our arrest data from the UCR in a similar way.
arrests = requests.get('https://ucr.fbi.gov/crime-in-the-u.s/2017/crime-in-the-u.s.-2017/topic-pages/tables/table-69')
arrests_soup = BeautifulSoup(arrests.content, 'html.parser')
# arrests_soup
arrests_elements = arrests_soup.findAll('tr')
# arrests_elements
This web page is formatted in a little bit of a different way. We see that there are two table rows per state. The first row contains the state name and information on juvenile arrests. The Second row contains the information on all ages. The information we want will the state name and arrests on all ages. The UCR provides data on a lot of different crimes, so I chose a few ones that I thought would be interesting to look at through the lens of party affiliation. Many of these crimes relate to tough on crime policies. The political parties also have legalized weapons and drugs to different extents. I looked at embezzlement to see if there could be any difference in White Collar crimes committed across party affiliations. We can again examine the elements list along with the website to figure out the index of each <td> within a <tr> element
arrests_proto_df = { 'State2' : [], 'Total' : [], 'Violent' : [], 'Property' : [], 'Drug Abuse' : [], 'Curfew and Loitering' : [],
'Embezzlement' : [], 'Weapons' : [], 'Prostitution and Commercialized Vice' : [], 'Vagrancy' : [], 'Population' : []}
# We will start at index 1 as we see this is where the dtata starts after examining the elements list
for i in range(1, len(arrests_elements)):
# after examining the data, we see that the all ages arrest rates are on even number indexs
if (i % 2) == 0:
arrests_proto_df['Total'].append(arrests_elements[i].findChildren('td')[0].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Violent'].append(arrests_elements[i].findChildren('td')[1].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Property'].append(arrests_elements[i].findChildren('td')[2].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Drug Abuse'].append(arrests_elements[i].findChildren('td')[20].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Curfew and Loitering'].append(arrests_elements[i].findChildren('td')[30].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Embezzlement'].append(arrests_elements[i].findChildren('td')[14].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Weapons'].append(arrests_elements[i].findChildren('td')[17].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Prostitution and Commercialized Vice'].append(arrests_elements[i].findChildren('td')[18].get_text().replace("\n", "").replace(",", ""))
arrests_proto_df['Vagrancy'].append(arrests_elements[i].findChildren('td')[27].get_text().replace("\n", "").replace(",", ""))
else:
arrests_proto_df['State2'].append(arrests_elements[i].findChildren('th')[0].get_text().replace("\n", ""))
We again do a similar process to get the population for each state from out third website
population = requests.get('https://www.newgeography.com/content/005837-the-migration-millions-2017-state-population-estimates')
population_soup = BeautifulSoup(population.content, 'html.parser')
# population_soup
population_elements = population_soup.findAll('tr')
# population_elements
Upon examining the data, we see that this website is strucutred similar to the first one
# upon examining the elements list we see that the state data starts at index 4
for i in range(4, len(population_elements) - 1):
# print ('********************')
arrests_proto_df['Population'].append(population_elements[i].findChildren('td')[2].get_text().replace(",", ""))
# print(population_elements[i].findChildren('td')[0].get_text())
# print(population_elements[i].findChildren('td')[2].get_text())
# party_affiliation_proto_df
In this stage, we will put the data into a dataframe than perform any data transfromations and cleaning we need to do. Usually, in the stage you deal with missing information. However, we do not have any missing values in our data. Here is an article that talks about different ways to handle missing data: https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
# combines the two proto_df together
proto_df = dict(party_affiliation_proto_df, **arrests_proto_df)
# puts the proto_df in a pandas dataframe
df = pd.DataFrame.from_dict(proto_df)
df.head(51)
| State | Percent Democrat | Percent Republican | Party Lean | State2 | Total | Violent | Property | Drug Abuse | Curfew and Loitering | Embezzlement | Weapons | Prostitution and Commercialized Vice | Vagrancy | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 35 | 50 | Solid Rep | ALABAMA | 153285 | 6005 | 18366 | 11267 | 0 | 161 | 1963 | 2 | 56 | 4874747 |
| 1 | Alaska | 31 | 52 | Solid Rep | ALASKA | 29152 | 2374 | 3695 | 1004 | 0 | 77 | 369 | 5 | 2 | 739795 |
| 2 | Arizona | 40 | 42 | Competitive | ARIZONA | 277698 | 12832 | 34919 | 33230 | 1126 | 457 | 3090 | 303 | 616 | 7016270 |
| 3 | Arkansas | 36 | 45 | Lean Rep | ARKANSAS | 123971 | 4715 | 12355 | 17515 | 310 | 54 | 1106 | 172 | 315 | 3004279 |
| 4 | California | 51 | 30 | Solid Dem | CALIFORNIA | 1093363 | 109117 | 105757 | 212025 | 998 | 990 | 28547 | 7056 | 6773 | 39536653 |
| 5 | Colorado | 46 | 37 | Lean Dem | COLORADO | 234409 | 7721 | 27157 | 16626 | 1052 | 126 | 2309 | 521 | 596 | 5607154 |
| 6 | Connecticut | 51 | 32 | Solid Dem | CONNECTICUT | 100211 | 3844 | 14074 | 9174 | 3 | 144 | 1242 | 183 | 24 | 3588184 |
| 7 | Delaware | 45 | 33 | Solid Dem | DELAWARE | 31549 | 1989 | 6156 | 4015 | 33 | 221 | 338 | 81 | 226 | 961939 |
| 8 | District of Columbia | 70 | 11 | Solid Dem | DISTRICT OF COLUMBIA5 | 17986 | 133 | 101 | 237 | 0 | 0 | 47 | 0 | 10 | 693972 |
| 9 | Florida | 42 | 39 | Competitive | FLORIDA6, 7 | 713085 | 36251 | 91063 | 124487 | 5 | 1057 | 6306 | 2468 | 0 | 20984400 |
| 10 | Georgia | 42 | 40 | Competitive | GEORGIA | 230640 | 10843 | 31080 | 40608 | 331 | 292 | 3317 | 379 | 1285 | 10429379 |
| 11 | Hawaii | 50 | 28 | Solid Dem | HAWAII | 35618 | 1227 | 3748 | 2724 | 217 | 11 | 247 | 120 | 0 | 1427538 |
| 12 | Idaho | 31 | 53 | Solid Rep | IDAHO | 51686 | 1441 | 5192 | 8432 | 137 | 43 | 276 | 13 | 15 | 1716943 |
| 13 | Illinois | 50 | 33 | Solid Dem | ILLINOIS6 | 64552 | 3823 | 11653 | 10915 | 51 | 1 | 4290 | 141 | 18 | 12802023 |
| 14 | Indiana | 41 | 43 | Competitive | INDIANA | 147199 | 8345 | 17546 | 26364 | 210 | 266 | 2283 | 305 | 347 | 6666818 |
| 15 | Iowa | 42 | 42 | Competitive | IOWA | 94485 | 4709 | 12280 | 9645 | 252 | 79 | 900 | 46 | 17 | 3145711 |
| 16 | Kansas | 34 | 48 | Solid Rep | KANSAS | 61144 | 2198 | 4958 | 9594 | 0 | 111 | 742 | 127 | 0 | 2913123 |
| 17 | Kentucky | 41 | 45 | Competitive | KENTUCKY | 219872 | 3398 | 17117 | 26397 | 2 | 390 | 967 | 166 | 70 | 4454189 |
| 18 | Louisiana | 40 | 43 | Competitive | LOUISIANA | 166867 | 9494 | 29489 | 25770 | 246 | 146 | 3234 | 412 | 243 | 4684333 |
| 19 | Maine | 47 | 39 | Lean Dem | MAINE | 40675 | 780 | 5578 | 3409 | 17 | 41 | 130 | 119 | 0 | 1335907 |
| 20 | Maryland | 56 | 28 | Solid Dem | MARYLAND | 165877 | 9692 | 21726 | 28992 | 65 | 77 | 3215 | 748 | 82 | 6052177 |
| 21 | Massachusetts | 57 | 26 | Solid Dem | MASSACHUSETTS | 118676 | 9257 | 13671 | 9791 | 2 | 122 | 1328 | 530 | 8 | 6859819 |
| 22 | Michigan | 45 | 38 | Lean Dem | MICHIGAN | 244417 | 11798 | 24692 | 33090 | 319 | 1174 | 4590 | 233 | 139 | 9962311 |
| 23 | Minnesota | 47 | 37 | Solid Dem | MINNESOTA | 143702 | 5769 | 26025 | 19281 | 869 | 23 | 2208 | 359 | 42 | 5576606 |
| 24 | Mississippi | 38 | 45 | Lean Rep | MISSISSIPPI | 78239 | 1492 | 8677 | 10268 | 90 | 383 | 1110 | 25 | 105 | 2984100 |
| 25 | Missouri | 38 | 45 | Lean Rep | MISSOURI | 228042 | 9371 | 31387 | 39979 | 644 | 290 | 3537 | 368 | 1096 | 6113532 |
| 26 | Montana | 37 | 51 | Solid Rep | MONTANA | 30824 | 1262 | 4965 | 2872 | 306 | 30 | 85 | 9 | 10 | 1050493 |
| 27 | Nebraska | 35 | 50 | Solid Rep | NEBRASKA | 5708 | 161 | 523 | 843 | 16 | 3 | 65 | 1 | 0 | 1920076 |
| 28 | Nevada | 42 | 39 | Competitive | NEVADA | 112004 | 6506 | 8946 | 8923 | 412 | 201 | 1885 | 2532 | 1648 | 2998039 |
| 29 | New Hampshire | 43 | 40 | Competitive | NEW HAMPSHIRE | 47516 | 896 | 3642 | 7656 | 11 | 120 | 139 | 79 | 82 | 1342795 |
| 30 | New Jersey | 48 | 33 | Solid Dem | NEW JERSEY | 281631 | 8577 | 24807 | 61989 | 592 | 269 | 3838 | 900 | 299 | 9005644 |
| 31 | New Mexico | 48 | 34 | Solid Dem | NEW MEXICO | 40552 | 2708 | 4340 | 2377 | 0 | 74 | 179 | 35 | 64 | 2088070 |
| 32 | New York | 52 | 29 | Solid Dem | NEW YORK6 | 259257 | 12440 | 43759 | 69904 | 0 | 48 | 3468 | 672 | 618 | 19849399 |
| 33 | North Carolina | 44 | 39 | Competitive | NORTH CAROLINA | 240365 | 11630 | 34918 | 25902 | 18 | 1185 | 5147 | 374 | 5 | 10273419 |
| 34 | North Dakota | 28 | 56 | Solid Rep | NORTH DAKOTA | 39944 | 855 | 3787 | 5646 | 107 | 53 | 313 | 38 | 7 | 755393 |
| 35 | Ohio | 41 | 42 | Competitive | OHIO | 224423 | 8084 | 33068 | 37022 | 542 | 21 | 3681 | 1122 | 208 | 11658609 |
| 36 | Oklahoma | 35 | 49 | Solid Rep | OKLAHOMA | 106731 | 4820 | 14352 | 20024 | 862 | 408 | 2207 | 15 | 58 | 3930864 |
| 37 | Oregon | 49 | 36 | Solid Dem | OREGON | 124851 | 3687 | 17578 | 15682 | 439 | 49 | 2077 | 364 | 22 | 4142776 |
| 38 | Pennsylvania | 46 | 41 | Competitive | PENNSYLVANIA | 372570 | 20385 | 50437 | 63306 | 6723 | 478 | 5056 | 1755 | 298 | 12805537 |
| 39 | Rhode Island | 48 | 27 | Solid Dem | RHODE ISLAND | 22707 | 864 | 2458 | 1792 | 11 | 102 | 379 | 71 | 1 | 1059639 |
| 40 | South Carolina | 37 | 47 | Solid Rep | SOUTH CAROLINA | 153573 | 6500 | 23848 | 32266 | 32 | 358 | 2247 | 381 | 484 | 5024369 |
| 41 | South Dakota | 35 | 52 | Solid Rep | SOUTH DAKOTA | 63625 | 1748 | 3283 | 8900 | 171 | 37 | 334 | 33 | 565 | 869666 |
| 42 | Tennessee | 35 | 47 | Solid Rep | TENNESSEE | 350912 | 14682 | 38394 | 47826 | 1214 | 664 | 3081 | 955 | 5 | 6715984 |
| 43 | Texas | 38 | 41 | Competitive | TEXAS | 745719 | 30372 | 78504 | 136796 | 2301 | 794 | 12954 | 4482 | 872 | 28304596 |
| 44 | Utah | 29 | 56 | Solid Rep | UTAH | 104741 | 2252 | 14526 | 17556 | 356 | 32 | 896 | 379 | 38 | 3101833 |
| 45 | Vermont | 52 | 30 | Solid Dem | VERMONT | 13969 | 664 | 1683 | 1137 | 0 | 43 | 28 | 4 | 0 | 623657 |
| 46 | Virginia | 45 | 38 | Lean Dem | VIRGINIA | 258878 | 7068 | 26582 | 42060 | 741 | 1371 | 4099 | 617 | 177 | 8470020 |
| 47 | Washington | 49 | 34 | Solid Dem | WASHINGTON | 173344 | 8472 | 27055 | 12027 | 0 | 53 | 1793 | 650 | 124 | 7405743 |
| 48 | West Virginia | 40 | 44 | Competitive | WEST VIRGINIA | 39296 | 1832 | 5077 | 7277 | 9 | 116 | 342 | 109 | 27 | 1815857 |
| 49 | Wisconsin | 43 | 41 | Competitive | WISCONSIN | 252142 | 8023 | 29429 | 30781 | 1767 | 325 | 3449 | 451 | 809 | 5795483 |
| 50 | Wyoming | 27 | 56 | Solid Rep | WYOMING | 27914 | 612 | 2374 | 4612 | 90 | 18 | 78 | 48 | 36 | 579315 |
Above, we can see that the data matched up by looking at the states columns. Now we can drop the states2 column since we dont need it anymore
df = df.drop(['State2'], axis=1)
Here we need to make all our number variables numeric so we can work with them later (doing math operations, graphs, and machine learning algorithms). He also will change the number of arrests to be percent of arrests in population. We do this by dividing each arrest data by population than multiplying by 100. This allows us to compare between states.
df['Population'] = pd.to_numeric(df['Population'])
df['Percent Democrat'] = pd.to_numeric(df['Percent Democrat'])
df['Percent Republican'] = pd.to_numeric(df['Percent Republican'])
df['Total'] = pd.to_numeric(df['Total'])
df['Total'] = df['Total']/df['Population']*100
df['Violent'] = pd.to_numeric(df['Violent'])
df['Violent'] = df['Violent']/df['Population']*100
df['Property'] = pd.to_numeric(df['Property'])
df['Property'] = df['Property']/df['Population']*100
df['Drug Abuse'] = pd.to_numeric(df['Drug Abuse'])
df['Drug Abuse'] = df['Drug Abuse']/df['Population']*100
df['Curfew and Loitering'] = pd.to_numeric(df['Curfew and Loitering'])
df['Curfew and Loitering'] = df['Curfew and Loitering']/df['Population']*100
df['Embezzlement'] = pd.to_numeric(df['Embezzlement'])
df['Embezzlement'] = df['Embezzlement']/df['Population']*100
df['Weapons'] = pd.to_numeric(df['Weapons'])
df['Weapons'] = df['Weapons']/df['Population']*100
df['Prostitution and Commercialized Vice'] = pd.to_numeric(df['Prostitution and Commercialized Vice'])
df['Prostitution and Commercialized Vice'] = df['Prostitution and Commercialized Vice']/df['Population']*100
df['Vagrancy'] = pd.to_numeric(df['Vagrancy'])
df['Vagrancy'] = df['Vagrancy']/df['Population']*100
df
| State | Percent Democrat | Percent Republican | Party Lean | Total | Violent | Property | Drug Abuse | Curfew and Loitering | Embezzlement | Weapons | Prostitution and Commercialized Vice | Vagrancy | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 35 | 50 | Solid Rep | 3.144471 | 0.123186 | 0.376758 | 0.231130 | 0.000000 | 0.003303 | 0.040269 | 0.000041 | 0.001149 | 4874747 |
| 1 | Alaska | 31 | 52 | Solid Rep | 3.940551 | 0.320900 | 0.499463 | 0.135713 | 0.000000 | 0.010408 | 0.049879 | 0.000676 | 0.000270 | 739795 |
| 2 | Arizona | 40 | 42 | Competitive | 3.957915 | 0.182889 | 0.497686 | 0.473613 | 0.016048 | 0.006513 | 0.044040 | 0.004319 | 0.008780 | 7016270 |
| 3 | Arkansas | 36 | 45 | Lean Rep | 4.126481 | 0.156943 | 0.411247 | 0.583002 | 0.010319 | 0.001797 | 0.036814 | 0.005725 | 0.010485 | 3004279 |
| 4 | California | 51 | 30 | Solid Dem | 2.765441 | 0.275989 | 0.267491 | 0.536275 | 0.002524 | 0.002504 | 0.072204 | 0.017847 | 0.017131 | 39536653 |
| 5 | Colorado | 46 | 37 | Lean Dem | 4.180534 | 0.137699 | 0.484328 | 0.296514 | 0.018762 | 0.002247 | 0.041180 | 0.009292 | 0.010629 | 5607154 |
| 6 | Connecticut | 51 | 32 | Solid Dem | 2.792805 | 0.107129 | 0.392232 | 0.255673 | 0.000084 | 0.004013 | 0.034614 | 0.005100 | 0.000669 | 3588184 |
| 7 | Delaware | 45 | 33 | Solid Dem | 3.279730 | 0.206770 | 0.639957 | 0.417386 | 0.003431 | 0.022974 | 0.035137 | 0.008420 | 0.023494 | 961939 |
| 8 | District of Columbia | 70 | 11 | Solid Dem | 2.591747 | 0.019165 | 0.014554 | 0.034151 | 0.000000 | 0.000000 | 0.006773 | 0.000000 | 0.001441 | 693972 |
| 9 | Florida | 42 | 39 | Competitive | 3.398167 | 0.172752 | 0.433956 | 0.593236 | 0.000024 | 0.005037 | 0.030051 | 0.011761 | 0.000000 | 20984400 |
| 10 | Georgia | 42 | 40 | Competitive | 2.211445 | 0.103966 | 0.298004 | 0.389362 | 0.003174 | 0.002800 | 0.031804 | 0.003634 | 0.012321 | 10429379 |
| 11 | Hawaii | 50 | 28 | Solid Dem | 2.495065 | 0.085952 | 0.262550 | 0.190818 | 0.015201 | 0.000771 | 0.017303 | 0.008406 | 0.000000 | 1427538 |
| 12 | Idaho | 31 | 53 | Solid Rep | 3.010350 | 0.083928 | 0.302398 | 0.491105 | 0.007979 | 0.002504 | 0.016075 | 0.000757 | 0.000874 | 1716943 |
| 13 | Illinois | 50 | 33 | Solid Dem | 0.504233 | 0.029862 | 0.091025 | 0.085260 | 0.000398 | 0.000008 | 0.033510 | 0.001101 | 0.000141 | 12802023 |
| 14 | Indiana | 41 | 43 | Competitive | 2.207935 | 0.125172 | 0.263184 | 0.395451 | 0.003150 | 0.003990 | 0.034244 | 0.004575 | 0.005205 | 6666818 |
| 15 | Iowa | 42 | 42 | Competitive | 3.003613 | 0.149696 | 0.390373 | 0.306608 | 0.008011 | 0.002511 | 0.028610 | 0.001462 | 0.000540 | 3145711 |
| 16 | Kansas | 34 | 48 | Solid Rep | 2.098916 | 0.075452 | 0.170195 | 0.329337 | 0.000000 | 0.003810 | 0.025471 | 0.004360 | 0.000000 | 2913123 |
| 17 | Kentucky | 41 | 45 | Competitive | 4.936297 | 0.076288 | 0.384290 | 0.592633 | 0.000045 | 0.008756 | 0.021710 | 0.003727 | 0.001572 | 4454189 |
| 18 | Louisiana | 40 | 43 | Competitive | 3.562236 | 0.202676 | 0.629524 | 0.550132 | 0.005252 | 0.003117 | 0.069039 | 0.008795 | 0.005188 | 4684333 |
| 19 | Maine | 47 | 39 | Lean Dem | 3.044748 | 0.058387 | 0.417544 | 0.255182 | 0.001273 | 0.003069 | 0.009731 | 0.008908 | 0.000000 | 1335907 |
| 20 | Maryland | 56 | 28 | Solid Dem | 2.740782 | 0.160141 | 0.358978 | 0.479034 | 0.001074 | 0.001272 | 0.053121 | 0.012359 | 0.001355 | 6052177 |
| 21 | Massachusetts | 57 | 26 | Solid Dem | 1.730016 | 0.134945 | 0.199291 | 0.142730 | 0.000029 | 0.001778 | 0.019359 | 0.007726 | 0.000117 | 6859819 |
| 22 | Michigan | 45 | 38 | Lean Dem | 2.453417 | 0.118426 | 0.247854 | 0.332152 | 0.003202 | 0.011784 | 0.046074 | 0.002339 | 0.001395 | 9962311 |
| 23 | Minnesota | 47 | 37 | Solid Dem | 2.576872 | 0.103450 | 0.466682 | 0.345748 | 0.015583 | 0.000412 | 0.039594 | 0.006438 | 0.000753 | 5576606 |
| 24 | Mississippi | 38 | 45 | Lean Rep | 2.621863 | 0.049998 | 0.290774 | 0.344090 | 0.003016 | 0.012835 | 0.037197 | 0.000838 | 0.003519 | 2984100 |
| 25 | Missouri | 38 | 45 | Lean Rep | 3.730119 | 0.153283 | 0.513402 | 0.653943 | 0.010534 | 0.004744 | 0.057855 | 0.006019 | 0.017927 | 6113532 |
| 26 | Montana | 37 | 51 | Solid Rep | 2.934241 | 0.120134 | 0.472635 | 0.273395 | 0.029129 | 0.002856 | 0.008091 | 0.000857 | 0.000952 | 1050493 |
| 27 | Nebraska | 35 | 50 | Solid Rep | 0.297280 | 0.008385 | 0.027239 | 0.043905 | 0.000833 | 0.000156 | 0.003385 | 0.000052 | 0.000000 | 1920076 |
| 28 | Nevada | 42 | 39 | Competitive | 3.735909 | 0.217009 | 0.298395 | 0.297628 | 0.013742 | 0.006704 | 0.062874 | 0.084455 | 0.054969 | 2998039 |
| 29 | New Hampshire | 43 | 40 | Competitive | 3.538589 | 0.066726 | 0.271225 | 0.570154 | 0.000819 | 0.008937 | 0.010352 | 0.005883 | 0.006107 | 1342795 |
| 30 | New Jersey | 48 | 33 | Solid Dem | 3.127272 | 0.095240 | 0.275461 | 0.688335 | 0.006574 | 0.002987 | 0.042618 | 0.009994 | 0.003320 | 9005644 |
| 31 | New Mexico | 48 | 34 | Solid Dem | 1.942080 | 0.129689 | 0.207847 | 0.113837 | 0.000000 | 0.003544 | 0.008573 | 0.001676 | 0.003065 | 2088070 |
| 32 | New York | 52 | 29 | Solid Dem | 1.306120 | 0.062672 | 0.220455 | 0.352172 | 0.000000 | 0.000242 | 0.017472 | 0.003385 | 0.003113 | 19849399 |
| 33 | North Carolina | 44 | 39 | Competitive | 2.339679 | 0.113205 | 0.339887 | 0.252126 | 0.000175 | 0.011535 | 0.050100 | 0.003640 | 0.000049 | 10273419 |
| 34 | North Dakota | 28 | 56 | Solid Rep | 5.287844 | 0.113186 | 0.501328 | 0.747426 | 0.014165 | 0.007016 | 0.041435 | 0.005030 | 0.000927 | 755393 |
| 35 | Ohio | 41 | 42 | Competitive | 1.924955 | 0.069339 | 0.283636 | 0.317551 | 0.004649 | 0.000180 | 0.031573 | 0.009624 | 0.001784 | 11658609 |
| 36 | Oklahoma | 35 | 49 | Solid Rep | 2.715205 | 0.122619 | 0.365111 | 0.509405 | 0.021929 | 0.010379 | 0.056145 | 0.000382 | 0.001476 | 3930864 |
| 37 | Oregon | 49 | 36 | Solid Dem | 3.013704 | 0.088998 | 0.424305 | 0.378538 | 0.010597 | 0.001183 | 0.050135 | 0.008786 | 0.000531 | 4142776 |
| 38 | Pennsylvania | 46 | 41 | Competitive | 2.909445 | 0.159189 | 0.393869 | 0.494364 | 0.052501 | 0.003733 | 0.039483 | 0.013705 | 0.002327 | 12805537 |
| 39 | Rhode Island | 48 | 27 | Solid Dem | 2.142900 | 0.081537 | 0.231966 | 0.169114 | 0.001038 | 0.009626 | 0.035767 | 0.006700 | 0.000094 | 1059639 |
| 40 | South Carolina | 37 | 47 | Solid Rep | 3.056563 | 0.129369 | 0.474647 | 0.642190 | 0.000637 | 0.007125 | 0.044722 | 0.007583 | 0.009633 | 5024369 |
| 41 | South Dakota | 35 | 52 | Solid Rep | 7.316027 | 0.200997 | 0.377501 | 1.023381 | 0.019663 | 0.004255 | 0.038406 | 0.003795 | 0.064967 | 869666 |
| 42 | Tennessee | 35 | 47 | Solid Rep | 5.225027 | 0.218613 | 0.571681 | 0.712122 | 0.018076 | 0.009887 | 0.045876 | 0.014220 | 0.000074 | 6715984 |
| 43 | Texas | 38 | 41 | Competitive | 2.634622 | 0.107304 | 0.277354 | 0.483300 | 0.008129 | 0.002805 | 0.045766 | 0.015835 | 0.003081 | 28304596 |
| 44 | Utah | 29 | 56 | Solid Rep | 3.376745 | 0.072602 | 0.468304 | 0.565988 | 0.011477 | 0.001032 | 0.028886 | 0.012219 | 0.001225 | 3101833 |
| 45 | Vermont | 52 | 30 | Solid Dem | 2.239853 | 0.106469 | 0.269860 | 0.182312 | 0.000000 | 0.006895 | 0.004490 | 0.000641 | 0.000000 | 623657 |
| 46 | Virginia | 45 | 38 | Lean Dem | 3.056404 | 0.083447 | 0.313836 | 0.496575 | 0.008749 | 0.016187 | 0.048394 | 0.007285 | 0.002090 | 8470020 |
| 47 | Washington | 49 | 34 | Solid Dem | 2.340670 | 0.114398 | 0.365325 | 0.162401 | 0.000000 | 0.000716 | 0.024211 | 0.008777 | 0.001674 | 7405743 |
| 48 | West Virginia | 40 | 44 | Competitive | 2.164047 | 0.100889 | 0.279593 | 0.400747 | 0.000496 | 0.006388 | 0.018834 | 0.006003 | 0.001487 | 1815857 |
| 49 | Wisconsin | 43 | 41 | Competitive | 4.350664 | 0.138435 | 0.507792 | 0.531121 | 0.030489 | 0.005608 | 0.059512 | 0.007782 | 0.013959 | 5795483 |
| 50 | Wyoming | 27 | 56 | Solid Rep | 4.818449 | 0.105642 | 0.409794 | 0.796113 | 0.015536 | 0.003107 | 0.013464 | 0.008286 | 0.006214 | 579315 |
In this stage, we will plot our data in different ways in order to find patterns that we could make hypothesis about and apply machine learning algorithms to. In this stage, it is important to use good visuals. Considering things like colors, size, and labels will be important.
# To graph maps, we will need a geojson. For this turotial, we will download it from here:
# https://public.opendatasoft.com/explore/dataset/georef-united-states-of-america-state-millesime/table/?disjunctive.ste_code&disjunctive.ste_name&sort=year
# This geojson provides geo data for all 50 states and DC
geojson = gpd.read_file(r'georef-united-states-of-america-state-millesime.geojson')
We will now create three maps showing the concentration of republicans, democrats and arrests in each state. This allows us to visually see where the concentrations exist.
# Map showing concentration of democratics
geojson=geojson[['ste_name','geometry']] # only select the data we will be working with (state name and geometry columns)
us_map = folium.Map(location=[40, -96], zoom_start=3,tiles='openstreetmap')
folium.Choropleth(
geo_data=r'georef-united-states-of-america-state-millesime.geojson',
data=df,
columns=['State', 'Percent Democrat'],
key_on='feature.properties.ste_name', # the variable containing state name in the geojosn
fill_color='PuBu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Percent Democrat', #title of the legend
line_color='black').add_to(us_map)
us_map